Introduction to Open Data Science - Course Project

Learning expectations for the course

My background

MSC. biology, currently working at SYKE, ca. 5 years experience. Main interests: sustainability at macro scale, national EE-IO analysis, food system sustainability modeling, circular economy..

What I’m looking for by attending

In my projects at SYKE we’re utilizing R in various capacities, on a daily basis. I’m expecting to expand my skill level concerning R, and maybe find new tools/packages to use in my work. Maybe this’ll be easier when there’s some structure to the learning.

Our team just implemented Git as well, so it’s nice to get a bit deeper into it as well.

Additionally, the statistical component will most likely benefit me as well. I wish to expand my repertuare of analytical capabilities overall. At this point, I’m not looking to learn any single particular analysis type (since it’s hard to tell what will be useful in the coming years), so it’s good to have a solid reference base

Impressions on Excercise 1 and the R for Health Data Science book

Initial impressions suggest this would’ve been a good learning material when I first started dealing with R. The basics are made simple to understand. but the scope seems to go quite a bit deeper than that. There seems to be plenty to learn and I’m tempted to treat this as a reference material in future too (or a tutorial for colleagues, if I’m allowed to share it outside the course?)

## [1] "Chapter complete on: Fri Dec  8 19:02:45 2023"

2: Regression and model validation

Describe the work you have done this week and summarize your learning.

Step 1: Data wrangling

1.1: Install required packages

  • To make data retrievals more fluent, we can use the Here package to make setting file paths easy.
  • “Here” eliminates the need of setting precise file pathways for data retrieval or saving, as it automatically looks for the designated file in its “starting point path”, by default the current working directory, viewable by command here().
if (!require(here) == T) {
  install.packages("here") #Checks if the package is NOT installed. If this evaluates to TRUE, runs installation.  
}
library(here)


if (!require(tidyverse) == T) {
  install.packages("tidyverse")
}
library(tidyverse)

if (!require(GGally) == T) {
  install.packages("GGally")   
}
library(GGally)

1.2 Run the script “create_learning2014.R”

  • The required data set, learning 2014, is downloaded and edited in a separate script.
  • Package “here” automatically looks for the file in the path specified in here(), which defaults to the current project work directory.

Step 2: Analysis phase

2.1 Graphical summary of the variables

  • First off, we produce a faceted plot of all variables.
ggpairs(Data, mapping = aes(), lower = list(combo = wrap("facethist", bins = 20)))

Commentary

  • Considerably skewed gender distribution.
    • Excact values 110 female, 56 male.
  • No immediately obvious patterns or clustering can be seen in the scatter plots.
    • Except maybe for column 2. In these plots, majority of data points cluster to the left (towards low values).
  • Multiple statistically significant correlations are evident in the data.
  • Variables “Age” and “surf” appear to correlate negatively. We see a slight statistical significance for it as well ( . = significance threshold 0.1)
  • Distribution of “Age” is strongly skewed towards low values (see lineplot at intersection Age x Age)
    • Mostly people on the younger side are represented in the data.
  • Attitude and Points correlate positively and show a high significance level.
    • Strongest positive correlation coefficient that the data has.
  • Variables “surf” and “deep” likewise correlate strongly, but in a different direction (negative coefficient)
    • Their correlation is the strongest negative one seen in the data.,

2.2 Simple linear regression

  • For this task, we create a model that attempts to explain variation in points by the values of deep learning, attitude and age.
x<-lm(Points ~ deep+Attitude+Age, data = Data) #Create model, assign a name "x" to it
summary(x)
## 
## Call:
## lm(formula = Points ~ deep + Attitude + Age, data = Data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -17.206  -3.430   0.204   3.979  10.950 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 15.60773    3.38966   4.605 8.32e-06 ***
## deep        -0.60275    0.75031  -0.803    0.423    
## Attitude     0.35941    0.05696   6.310 2.56e-09 ***
## Age         -0.07716    0.05322  -1.450    0.149    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.307 on 162 degrees of freedom
## Multiple R-squared:  0.2043, Adjusted R-squared:  0.1896 
## F-statistic: 13.87 on 3 and 162 DF,  p-value: 4.305e-08

Commentary

  • Just one variable, attitude, had a statistically significant relationship with Points.
  • Next, we refine the model by removing the 2 nonsignificant variables
  • Can these two be purged from the model, without suffering a loss in explanatory power?

Model version 2 (removal of the non-significant explanatory variables)

z<-lm(Points ~ Attitude, data = Data) #Create model, assign a name "x" to it
summary(z)
## 
## Call:
## lm(formula = Points ~ Attitude, data = Data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.63715    1.83035   6.358 1.95e-09 ***
## Attitude     0.35255    0.05674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09
  • Fit remains at about the same ballpark, ca. 20 %, despite the removal of the two other variables.
    • This single variable explains the variation in the data just as well, despite being simpler than the first.
  • P value indicates the likelihood of obtaining a similar result as seen, assuming that the null hypotesis is true (in our case, h0 would be something like: “No relationship exists between attitude and points”)
    • Low P lends support to the idea that we can reject h0.
    • Here, P is lower than 0.001 -> strong support for rejecting h0.
  • “Estimate” shows regression coefficients, that indicate how severe the relation is (simply put, bigger value = steeper slope).
    • When the value of attitude increases by 1 unit, response variable increases by 0.35
  • Model fit parameters (R2) indicate a rather poor fit (circa 20%). R2 indicates, how well or badly the fitted model suits the data. Poor fit may result, if we attempt to describe a non-linear relationship using a linear model…
    • Attitude explains circa 1/5 of the variation in Points.

Residuals

  • Next, we plot model residuals to assess whether or not the model fulfills prerequisite assumptions of linear regression.
  • Failing to meet these leads to unstable models, biased regression coefficients and in turn unreliable predictive power.
plot(z, which=c(1,2,5))

  • The assumptions we diagnose are:
    • Equality of variances.
      • The method assumes that variance in the data remains equal, despite parameters taking on different values.
      • In stats jargon, data that has equal variances is called “homoscedastic”, while heteroscedastic data is the opposite.
      • To diagnose this, we can look at the “Residual v Fitted, or”Scale-location” plots (not shown).
      • Here, if the assumption is met and variances are equal, the data points should be spread more or less evenly along the line.
      • On the other hand, if we saw e.g. a funnel-like shape (points that cluster closer together on one end of X axis than on the other end), we could say that the variances were not equal.
      • Here, most of the data points are spread relatively even…except for the points on the extreme right of the plot.
        • Points from X axis tick 26 onward are grouped more tightly together.
        • Variances may not be entirely equal…
        • However, I am tempted to still say that they are “good enough”, as only a tiny bit of the data mass seems to misbehave.
      • Linearity assumption.
        • If we want to use linear regression, the relationship we examine should be linear in form. If it is not, a method change would be warranted.
        • This can be examined via “Residuals vs Fitted” plot.
        • In a well-behaved plot, we look for data points spread randomly around the 0-line, in a “shotgun blast” kind of arrangement, with no observable patterns or funnel shapes (see 1st assumption, this plot can be used for it as well). Such a distribution indicates that the relationship is indeed linear.
        • Here, all seems to be in order as the points are randomly spread around the zero line.
      • Normality of residuals
        • Residuals should be normally distributed.
        • Diagnosed via the Q-Q residuals plot.
        • In an ideal case, the data points should follow the line on the plot without deviating.
        • Here, data seems mostly fine, as only the tails diverge slightly from the line. Not perfect, but good enough?. 100% normality is a bit too much to expect from real-world data?
  • Next, we can check if there are any highly influential observations in the data mass.
    • Residuals v. leverage plot highlights cases that have a high influence on model parameters. If we were to remove them, model parameters (coeffs, R2…) would change considerably.
    • Plot shows that observation on row 71 has a high impact.
    • Removing it would have a high impact on model coefficients.
    • In a real world application, I would take a closer look at these, and see if their values make sense or are there mistakes in the data. Contextual knowledge is needed for this.
    • In each such case, a judgement call would have to be made on whether the value should stay or leave.

Assignment 3: Logistic Regression

Install required packages

  • Note that running this procedure requires:
    • Tidyverse
    • gt
    • here

Reading source data into R

  • First off, we read the required source data into R.
    • A separate script was created for this.
    • Data frame “alc” contains the necessary data
source(here("Scripts/create_data_assignment3.R"))

Forming initial hypotheses

  • In this excercise, we try to find out if there is a connection between alcohol consumption and other variables
  • The meanings of each variable are explained under “Additional variable information” here
  • The following 4 are chosen and initial hypotheses for the proposed relationship formed:
    • absences: could high rates of alcohol consumption be affiliated with high rates of absences from school?
    • health: hypothesis is that people consuming higher doses of alcohol may have a worse health condition than those who abstain?
    • failures: increased likelihood of course failures among heavy drinkers?
    • famrel: increased likelihood of heavy drinking leading to family conflicts (poorer family relations?)

Summary table for variables, graphical examinatons

taulukko<-alc %>% group_by(high_use) %>% summarise(`Average absence` = mean(absences),
                                            `St.Dev, absences.`=sd(absences),
                                            `Average health` = mean(health),
                                            `St.Dev., health`=sd(health),
                                            `Average failure` = mean(failures),
                                            `St.Dev., failures`=sd(failures),
                                            `Average family relations` = mean(famrel),
                                            `St.Dev., family relations` = sd(famrel))
gt(taulukko) %>%
  fmt_number(decimals=2) %>% cols_align("center") %>% opt_stylize(style=1, color="green")  %>% tab_header("Summary table")
Summary table
high_use Average absence St.Dev, absences. Average health St.Dev., health Average failure St.Dev., failures Average family relations St.Dev., family relations
FALSE 3.71 4.46 3.49 1.41 0.12 0.44 4.01 0.89
TRUE 6.38 7.06 3.73 1.40 0.35 0.75 3.77 0.93

Absence

  • on average, absence rates seem to be higher in the high alcohol consumption group than in the low consumption group (6.4 versus 3.7). More variation in the high-consuming group.
    • Consistent with initial assumptions.
ggplot(alc, aes(x = high_use, y = absences)) + geom_boxplot() + ggtitle("Distribution of `absences` values by group")

Health

  • Regarding health, differences between groups seem small.
    • Contrary to expectation, alc consumption does not seem to differ between groups?
  • Mean 3.5 versus 3.7, almost the same. SD at 1.4 in both…
ggplot(alc, aes(x = high_use, y = health)) + geom_boxplot()+ggtitle("Distribution of `health` values by group")

Failure rate

  • Failure rate is on average slightly higher in the high consumption group, as was assumed, but the difference is less pronounced than I was expecting.
  • 0 failures seem to be the most common situation regardless of group, any other values are marginally represented in comparison.
Frequency of `failure` values by group
failures n
FALSE
0 238
1 12
2 8
3 1
TRUE
0 87
1 12
2 9
3 3

Family relations

  • Family relation values appear to be on average slightly poorer in the high alcohol consumption group.
  • This is the sort of finding I expected to see, but the magnitude of difference between the groups is lower than I thought it would be. .

Logistic regression

Defining the GLM model, and extracting the coefficients from summary

# find the model with glm()
m <- glm(high_use ~ failures+absences+health+famrel, data = alc, family = "binomial")

# create model summary and extract the coeffs
summary(m)

coeffs<-m %>% summary() %>% coefficients()
  • All others except family relations are positively correlated with high alcohol use.
  • Strongest positive correlation is Failures. Also statistically significant.
  • Also absences correlates positively, shows statistical significance.
  • Family relations correlates negatively, is statistically significant. Pretty much as I would expect
  • Health: no statistical significance seen in this data. Does not necessarily indicate that there is absolutely no link between health status and drinking in existence even if this data doesn’t show it, as the body of medical evidence in published literature indicates otherwise. Would a different set of data/different measured metric have given a different outcome? Could the students self-assess their health status as more rosy than it really is, as not all aspects of ill health are immediately obvious to the respondent (e.g. high blood pressure)?
  • Next step: odds ratios

Converting coeffs to odds ratios, calculating confidence intervals

OddsRatio<-exp(coeffs)
ConfInt<- confint(m)

Result<-cbind(OddsRatio, ConfInt)
print(Result)
##              Estimate Std. Error    z value Pr(>|z|)       2.5 %      97.5 %
## (Intercept) 0.4634471   1.790547  0.2670735 1.205335 -1.93186722  0.36139875
## failures    1.8054371   1.219527 19.6267894 1.002916  0.20659753  0.99228511
## absences    1.0848599   1.022825 36.9319701 1.000307  0.03933340  0.12816947
## health      1.1387751   1.092594  4.3383139 1.152858 -0.04154535  0.30645324
## famrel      0.7598363   1.137448  0.1185288 1.033507 -0.52884754 -0.02203822
  • These estimates are on a different scale than in the more familiar linear regression
    • There, a coefficient can be interpreted directly. When the value of the independent variable increases by 1 unit, the dependent’s value changes by 1 unit as a response . Not the case here.
    • These are odds based and need to be thought of in terms of likelihood, not as a direct change in the value of dependent variable’s value as independent’s value changes.
  • Odds ratios quantify the strength of the relationship between dependent and independent vars.
  • If OR = 1, the odds of being a heavy drinker or not are the same, regardless of the dependent variable.
  • Here, failures has the strongest link with heavy drinking. An increase of failures by 1 increases the likelihood of being a heavy drinker by almost 2-fold (1.8)
  • For absences and health, the increases in odds are less pronounced.
  • The relationship with famrel is inverted in comparison to the others: Famrel increase decreases the odd of heavy drinking by 0.25. This was pretty much the expected outcome

Predictive power

Refining the model: culling variables that showed no statistical significance

  • Health will be dropped, all others remain
m <- glm(high_use ~ failures+absences+famrel, data = alc, family = "binomial")

Predicting the probability of having the status “high use” based on the model, add as new column to alc frame

  • here, we take the model from before, and based on it calculate the probability that a certain student (row in data) has the attribute “Strong drinker”.

  • Then, we use the probabilities to make a prediction of high_use.

  • The model takes actual values from the data, and based on them estimates how likely it is that a certain row (student) has the attribute (heavy drinker) These prediction may, or may not, match what the actual value was on each row.

alc$predicted_probabilities <- predict(m, type = "response")  #type determines which outcome is given, see ?predict


alc <- mutate(alc,prediction  = predicted_probabilities > 0.5) #Selects those, for which the model indicates should have more than 50% prob of being a heavy drinker. T if over 50, F if not. 

Comparison of actual values (High use or not) and predicted values. How many mismatches in actual terms, how accurate were we?

Direct comparison by counts and percentages

print(x)
##         prediction
## high_use FALSE TRUE
##    FALSE   242   17
##    TRUE     92   19
print (y)
##         prediction
## high_use     FALSE      TRUE
##    FALSE 65.405405  4.594595
##    TRUE  24.864865  5.135135
  • Rows that were in reality (as in the value of high_use) FALSE, were predicted to be FALSE by the model (correctly classified) 242 times (65.4 %)
  • For TRUE-TRUE case, just 19 observations were recorded (5 %). In this kind of tabulation, it has to be remembered that the groups are not nearly equal in size. There’s just 111 TRUE cases, and 259 FALSEs.

Penalty function definition

# define a loss function (mean prediction error)
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5 
  mean(n_wrong)
}

Idea of the function, what it does and why

  • Task: Find the proportion of “light drinkers” that the model misclassified as “heavies”

  • Set probability of being heavy drinker at 0%

  • Function takes:

    • class” of a student, can be either 1 or 0, heavy drinker or not.
    • Then it takes a set probability, and substracts them.
  • Say, that it finds a student with a class of 0.

  • Substracts the probability (0) and class value. If class is 0 and prob is 0, result is also 0, lower than 0.5 (FALSE).This is a correct classification, as we just set the probability of being a heavy drinker at 0.

  • If it were a positive number, it’d have found a heavy drinker (1- 0 = 1, which is > 0.5, TRUE), despite the prob being 0 (=found a misclassed case)

  • A converse case: we want to find cases in which class is 0, even though we set the probability of being heavy drinker at 100 %

  • Conversely, if it was set at 1 (100% likelihood of being a heavy drinker), and the function finds a class of 1, the function would look like: abs(1-1) = 0, which is not > 0.5. We would have a correctly classed case.

  • If it finds a misclassified one, it’d be (0-1) = -1, which as an absolute number (drop the minus) is higher than 0.5.

  • In each scenario, the function also calculates the mean

  • Here, we set the function to find students of class 0, and check if they are mismatched as 1

loss_func(class = alc$high_use, prob = 0)
## [1] 0.3
  • The average number of mismatches, of students that were found to be of class 1, despite the probability of of being such set at zero, was 0.3.
  • If there were only mismatches found by the function (i.e. all 1:s in the n_wrong object), the mean would be 1. The less mismatches, the lower the value –> Better result regarding accuracy.
loss_func(class = alc$high_use, prob = 1)
## [1] 0.7
  • The opposite case (0’s found when probability was 1) was 0.7.

  • Model appears to misclassify 0:s as 1:s almost doubly more often, than the opposite way.

Cross validation

# K-fold cross-validation
library(boot)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = 10)

# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2972973
  • Compared to the model in the excercise, these predictors we now look at have a bit higher mean prediction error, but not overwhelmingly so. The difference is rather small (circa 0.3 versus 0.26)

further delving into the bonus section if time allows…

Chapter 4: Clustering & Classification

Data wrangling

Not included, as this time it is listed as the final assignment and deals with next week’s data. This has been completed; create_human.R” is available in the repo under Scripts.

Installation of the necessary packages

require(tidyverse);require(here);require(MASS);require(corrplot); require(GGally)

Analysis

Data overview

  • First, we load the Boston dataset and check its dimensions & structure.
data(Boston)

str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506  14
  • Overall the dimensions are 506 x 14.

  • All of the 14 variables appear numerical (integer or double).

  • Next, we look up what each of the 14 vars represents.

  • The dataset has an official documentation.

    • Input command “?Boston” to view.
    • In the documentation, variable definitions are provided.
  • The data contains information on housing in the Boston region, USA.

  • Included variables are:

    • CRIM - per capita crime rate by town
    • ZN - proportion of residential land zoned for lots over 25,000 sq.ft.
    • INDUS - proportion of non-retail business acres per town.
    • CHAS - Charles River dummy variable (1 if tract bounds river; 0 otherwise)
    • NOX - nitric oxides concentration (parts per 10 million)
    • RM - average number of rooms per dwelling
    • AGE - proportion of owner-occupied units built prior to 1940
    • DIS - weighted distances to five Boston employment centres
    • RAD - index of accessibility to radial highways
    • TAX - full-value property-tax rate per $10,000
    • PTRATIO - pupil-teacher ratio by town
    • B - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
    • LSTAT - % lower status of the population
    • MEDV - Median value of owner-occupied homes in $1000’s
  • Next, graphical and table format summaries are generated for the data

  • First, summary as a table

summary(Boston)
  • The summary values for each var are wildly different. They are all on different scales, despite being all numeric. This is why standardization over vars will soon be done, to make them comparable.
ggpairs(Boston)

  • There’s quite a lot of options on what to look at here. I’m going to cherry pick some findings, instead of going through every variable.

  • Multiple variables have skewed distributions. For example:

    • Age skews strongly towards high values -> mainly old buildings (built before -1940)
    • Variable dis (distance to employment centers) has a skew towards low values (distance to employment centers commonly small)
  • Also, several variables have bimodality (= more than 1 peak in the distribution, meaning that 2 values are more common than others)

  • Scatterplot Indus x NOx appears to show 2 distinct groupings? For most data points, both NOX concentrations and the share of industrial acreage are low (these have a strong, statistically significant correlation coeff too!)

  • Crime rate appears to have a statistically significant correlation with almost all of these vars. Seems to correlate positively with the proportion of industrial acreage, NOX concentrations, large residential land areas…

  • The distribution of “indus” (business acreage) shows bimodality: we have two peaks, indicating that a couple of values are considerably more common than others. This variable also correlates w. high statistical significance with NOX emissions, which makes sense as the variable represents the prevalence of business acreage like industry.

  • Distribution of NOX is strongly skewed towards small values.

  • Age skews strongly towards high values. Overall, most construction in the regions of the data was done prior to 1940.

  • Again, we see bimodality in property tax rates. Low and high ends of the spectrum have clear peaks.

Dataset standardization

  • As explained above, all these variables are numeric but have wildly different measurement scales. Hence, standardization.
  • Let’s print summaries of both standardized and non-standardized data and compare
boston_scaled <- as.data.frame(scale(Boston))
summary(boston_scaled)
summary(Boston)
  • This procedure changed the scales on which all the different vars are measured. Previously, they were all different as they describe very different things. Now, we have “forced” all of them to a similar scale. See for example how the scale of “black” changes: max was almost 400, way more than for any other var. Due to standardization, it became 0.44. All the vars are now on the same scale.

Categorigal crime variable creation

  • Division to bins according to quantiles, set it as a new variable to the old frame
bins <- quantile(boston_scaled$crim) 

# create a categorical variable 'crime'
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, labels = c("low", "med_low", "med_high", "high"))


boston_scaled <- data.frame(boston_scaled, crime)

boston_scaled$crim<-NULL

*Dividing the data to training set and testing set

n <- nrow(boston_scaled)
# choose randomly 80% of the rows
ind <- sample(n,  size = nrow(boston_scaled) * 0.8)

# create train set and test set
train <- boston_scaled[ind,] #Randomly selects 80% of rows, index numbers
test <- boston_scaled[-ind,] #everything except the indices in the training set

LDA

  • Fitting LDA on the training data, then plotting it.
  • Here’ we seek to find a linear combination of variables that best separates the data into groups.
  • Looking at vector directions & magnitudes tells us about which LD it has the most effect on
  • Variables’ relationships with each other can also be seen. Parallel vectors show correlation, while perpendicular directions are the opposite case.
lda.fit <- lda(crime ~ . , data = train)


lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  graphics::arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes <- as.numeric(train$crime)

# plot the lda results (select both lines and execute them at the same time!)
plot(lda.fit, dimen = 2)
lda.arrows(lda.fit, myscale =2)

  • RAD has the largest impact on LD1. For LD2 it lookse to be either Nox or zn, albeit the magnitude appears similar…
  • Nox and rad, as well as zn and rad, are almost perpendicular with each other (like a 90 degree angle) -> no correlation.
  • Most of the other vars cluster very close together. None of them have vector magnitudes even close to those seen for zn, rad and nox.

Testing model prediction power

  • First, we save the “real” class values of the test set, and remove it from the frame.
  • We will use these to compare how well the same model we used on the training data classifies the actual test data-
correct_classes<-test$crime
test$crime<-NULL


# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)

# cross tabulate the results

table(correct = correct_classes, predicted = lda.pred$class)
  • 30 correct “lows”, only 7 of which were correctly classified. 23% success rate…
  • 25 med-lows, of which 12 are correctly classified. 48% success rate, better but still not good enough.
  • 26 med-highs, just 9 correctly placed. 34 %…
  • 21 highs, of which 19 got correctly placed. A 90% success rate for this category -> high crime rates appear to be predicted very accurately, while in all other categories the performance is very lackluster.

K-Means

*Clearing .env, reload Boston raw data and scale it

data(Boston)

Boston_scaled<-as.data.frame(scale(Boston))

*Calculating distances between data points

dist_eu <- dist(Boston_scaled)

Run K-means clustering algorithm on the scaled data here, we run it on 6 centers as optimization will follow.

# k-means clustering
km <- kmeans(Boston_scaled, centers = "6")
pairs(Boston_scaled[1:6], col=km$cluster)

*Quite a confusing table…K of 6 is arguably too many.

*Optimizing K

  • In the optimization, we look at the WCSS (Within group sum of squares)
  • A high WCSS indicates that even though we have classified data into a same cluster group, there would still exist considerable variation within that group.
  • We want to find a K value, for which the variation within each group is as small as possible (i.e. the clusters should be formed with the idea of grouping similar data points together, with as little “wild” variation between points in the same group as possible)
  • Visually, we check the plotted curve, and observe at which point (K value) the most “benefit” has been gained from increasing K.
  • The slope of WCSS is steep and rapidly decreases up to K-level of circa 2. After that, we slope grows considerably milder -> even if we increase K, the resulting “gains” (or here, losses, as we want small WCSS values) grow less and less pronounced.
    • So, after K= circa 2, not really worth it to add any more centers…
# Work with the exercise in this chunk, step-by-step. Fix the R code!
# MASS, ggplot2 and Boston dataset are available
set.seed(123)

# determine the number of clusters
k_max <- 10

# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(Boston_scaled, k)$tot.withinss})

# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Re-running analysis with K set to 2

km <- kmeans(Boston_scaled, centers = 2)

pairs(Boston_scaled[1:5], col = km$cluster)

pairs(Boston_scaled[6:10], col = km$cluster)

  • With K at 2, it’s a lot easier to see what’s happening.
  • Let’s zoom into some of these:
  • Multiple distinct groupings. For example:
    • Black data points in the tax - age pairplot: cluster of properties high in both tax rate and age
pairs(Boston_scaled[c(1,10)], col = km$cluster)

  • Crime rates and high property tax values (wealth indicator?) appear to cluster separately. Wealthier neighbourhoods, less crime?
pairs(Boston_scaled[c(3,5)], col = km$cluster)

  • Industrial zoning and Nitrogen oxide emissions cluster as well. A distinct grouping of black data points show high emissions, when the rate of industrial zoning is also high. And when the zoning is not industrial, emissions are also low.
pairs(Boston_scaled[c(7,14)], col = km$cluster)

  • Housing value and age appear to show a large cluster in black, with high age (old housing) and low value.
pairs(Boston_scaled[c(1,14)], col = km$cluster)

Bonus task

  • Load original data, standardize, and run K means with 3 clusters
data("Boston")

Boston_scaled<-as.data.frame(scale(Boston))


km <- kmeans(Boston_scaled, centers = "5")
  • Then, fit LDA with clusters as the target (dependent)
lda.fit <- lda(km$cluster ~ . , data = Boston_scaled)

lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  graphics::arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes <- as.numeric(km$cluster)

# plot the lda results (select both lines and execute them at the same time!)
plot(lda.fit, dimen = 2)
lda.arrows(lda.fit, myscale =2.4)

#More zoom
plot(lda.fit, dimen = 2)
lda.arrows(lda.fit, myscale =20)

  • Now, location on the riverside looks to be the main determinant for group 1. For 3, it appears to be accessibility to radial highways. Remaining variables cluster together in a single mass, with no single ones standing out.
  • In the more zoomed plot, LSTAT looks to be important for group 3, while zn is important for 2.

Super Bonus

  • Install plotly
library(plotly)
## Warning: package 'plotly' was built under R version 4.2.3
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
## 
##     select
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout

*Calculate matrix products

lda.fit <- lda(crime ~ . , data = train)

model_predictors <- dplyr::select(train, -crime)

dim(model_predictors)
dim(lda.fit$scaling)

matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
  • 3D Plot

  • Colored by the crime classes of the training data, we get the pink “High crime” data points grouped on their own (some overlap with med-high, though)

plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color= train$crime)

Data wrangling for next week

  • This has been completed; create_human.R” is available in the repo under Scripts and its output under Data.

Chapter 5: Dimensionality Reduction

Data wrangling continues

require("here");require("tidyverse");require("FactoMineR")
## Loading required package: FactoMineR
## Warning: package 'FactoMineR' was built under R version 4.2.3
#source(here("Scripts/create_human.R"))
rm(list=c("gii", "hd"))
## Warning in rm(list = c("gii", "hd")): object 'gii' not found
## Warning in rm(list = c("gii", "hd")): object 'hd' not found

Task 1: Explore the structure and the dimensions of the ‘human’ data and describe the dataset briefly, assuming the reader has no previous knowledge of it (this is now close to the reality, since you have named the variables yourself). (1 point)

  • Load the previously-created Human.csv dataset
  • This has been created by merging Human Development Index and Gender Inequality Index data.
library(readr)
human <- read_csv("Data/human.csv")
## New names:
## Rows: 195 Columns: 20
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (1): Country dbl (19): ...1, HDI_Rank, HDI, LifeExp, EduExp, EduMean, GNI,
## GNI_minus_HDIr...
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
human<-select(human, 2:length(human))
  • Metadata for the human dataset can be found at: https://hdr.undp.org/data-center/documentation-and-downloads. See tables 1 and 5.
  • The data contains country specific information on Human Development Index values, and associated variables describing the countries.
  • In the Human Development data, the following variables are included. New, one-word names for these are provided in parentheses and quotes.
    • HDI ranking in relation to the others
    • Country name
    • Human Development Index value (new name: “HDI”)
    • Life expectancy at birth (“LifeExp”)
    • Expected years of schooling (“EduExp”)
    • Mean years of schooling (“EduMean”)
    • Gross national income (GNI) per capita (“GNI”)
    • GNI per capita rank minus HDI rank (“GNI_minus_HDIrank”)
  • Variables from the Gender Inequality Index data are:
    • Country’s ranking in terms of the Gender Inequality Index (“GII_Rank”)
    • Country’s Gender Inequality Index ranking (“GII”)
    • Maternal Mortality Ratio by country (“MMRatio”)
    • Adolescent birth rate, i.e. births per 1,000 women ages 15–19 (“ABRate”)
    • Places in parliament held by women (“PRP”)
    • Population with at least some secondary education, %, for women and men respectively (“PSECED_F” and “PSECED_M”)
    • Labor force participation rate, %, for women and men respecively (“LFPR_F” and “LFPR_M”)
  • Calculated variables are:
    • Ratio of female and male populations with secondary education in each country (“PSECED_FM_Ratio”)
    • Ratio of labor force participation of females and males in each country (“LaborFM_ratio”)
  • Check dimensions and structure:
str(human)
dim(human)    

Task 2: Exclude unneeded variables: keep only the columns matching the following variable names (described in the meta file above): “Country”, “Edu2.FM”, “Labo.FM”, “Edu.Exp”, “Life.Exp”, “GNI”, “Mat.Mor”, “Ado.Birth”, “Parli.F” (1 point).

# Create a vector of vars to keep. Remember that you renamed your vars differently, copypasting the above names into a vector won't work: 
  
keepers<-c("Country",
  "PSECED_FM_Ratio",
  "LaborFM_ratio",
  "EduExp",
  "LifeExp",
  "GNI",
  "MMRatio",
  "ABRate",
  "PRP")

# Select only these vars from human dataframe
human<-human %>% select(all_of(keepers))

Task 3: Remove all rows with missing values (1 point).

  • Any country (row) with ANYTHING NA will be dropped.
human<-filter(human, complete.cases(human)==T)

Task 4: Remove the observations which relate to regions instead of countries. (1 point)

  • Seven final rows are regions instead of countries, dropping them
human<-human %>% slice(1:(nrow(human)-7))
dim(human)

Task 5 The data should now have 155 observations and 9 variables (including the “Country” variable). Save the human data in your data folder. You can overwrite your old ‘human’ data. (1 point).

dim(human)
  • Dimensions 155 x 9 as requested.
  • Saving to Data, but I will not overwrite, so that the original remains if I need to revert to it.
library(here)
write.csv(human, file=here("Data/human2.csv"))

Analysis phase

Task 1: Move the country names to rownames (see Exercise 5.5). Show a graphical overview of the data and show summaries of the variables in the data. Describe and interpret the outputs, commenting on the distributions of the variables and the relationships between them. (0-3 points)

  • Setting country names as rownames
library(tibble)
human_<-column_to_rownames(human, "Country")
  • Graphical summary

Commentary:

  • Basically all variables have a skewed distribution, most often towars small values,
    • Except for Female-male ratio in the workforce, and life expectency, both of which skew towards high values
  • Bear in mind that these are non-standardized values, measured at different scales and having very different ranges as well.
  • Multiple strong correlations, some of which are highlighted here:
    • Rate of adolescent births correlates negatively with Gross national income (very young mothers more common in less affluent nations)
    • Maternal mortality rates and adolescent births strongly correlate, positive relationship
    • Life expectency increases with increasing GNI, as expected
    • Expected years of schooling correlates strongly with GNI & life expectency (positive corr.), and negatively with adolescent births and maternal mortality
  • Looks to me that there might be potential to divide this set of countries into the “Affluent, progressive and educated” and the “less fortunate” subsets, the latter of which is characterized by low GDPs, low education rates, high maternal mortalities and rates of adolescent birth…
library(GGally)
ggpairs(human_, progress = F)

  • Table-form summary
summary(human_)
##  PSECED_FM_Ratio  LaborFM_ratio        EduExp         LifeExp     
##  Min.   :0.1717   Min.   :0.1857   Min.   : 5.40   Min.   :49.00  
##  1st Qu.:0.7264   1st Qu.:0.5984   1st Qu.:11.25   1st Qu.:66.30  
##  Median :0.9375   Median :0.7535   Median :13.50   Median :74.20  
##  Mean   :0.8529   Mean   :0.7074   Mean   :13.18   Mean   :71.65  
##  3rd Qu.:0.9968   3rd Qu.:0.8535   3rd Qu.:15.20   3rd Qu.:77.25  
##  Max.   :1.4967   Max.   :1.0380   Max.   :20.20   Max.   :83.50  
##       GNI            MMRatio           ABRate            PRP       
##  Min.   :   581   Min.   :   1.0   Min.   :  0.60   Min.   : 0.00  
##  1st Qu.:  4198   1st Qu.:  11.5   1st Qu.: 12.65   1st Qu.:12.40  
##  Median : 12040   Median :  49.0   Median : 33.60   Median :19.30  
##  Mean   : 17628   Mean   : 149.1   Mean   : 47.16   Mean   :20.91  
##  3rd Qu.: 24512   3rd Qu.: 190.0   3rd Qu.: 71.95   3rd Qu.:27.95  
##  Max.   :123124   Max.   :1100.0   Max.   :204.80   Max.   :57.50
  • As the values are non-standardized, all of them exhibit very different ranges.
  • Countries clearly have very different conditions. For instance, Life expectancies are at worst merely 49. Similarly broad range is seen for GNI.
  • If the data is sorted in an ascending order for these vars, a bleak picture is drawn for the “global south”.
head(arrange(human_, LifeExp, GNI), n= 20)
##                                    PSECED_FM_Ratio LaborFM_ratio EduExp LifeExp
## Swaziland                                0.8423077     0.6131285   11.3    49.0
## Lesotho                                  1.1526316     0.8027211   11.1    49.8
## Central African Republic                 0.3782772     0.8531140    7.2    50.7
## Sierra Leone                             0.4608295     0.9521739    8.6    50.9
## Côte d'Ivoire                            0.4651163     0.6437346    8.9    51.5
## Chad                                     0.1717172     0.8080808    7.4    51.6
## Mozambique                               0.2258065     1.0326087    9.3    55.1
## Cameroon                                 0.6103152     0.8307292   10.4    55.5
## Burundi                                  0.6385542     1.0158537   10.1    56.7
## South Africa                             0.9578393     0.7355372   13.6    57.4
## Zimbabwe                                 0.7854839     0.9275362   10.9    57.5
## Mali                                     0.5099338     0.6240786    8.4    58.0
## Uganda                                   0.6835821     0.9570707    9.8    58.5
## Congo (Democratic Republic of the)       0.3950617     0.9658470    9.8    58.7
## Burkina Faso                             0.2812500     0.8566667    7.8    58.7
## Benin                                    0.4185185     0.8633461   11.1    59.6
## Togo                                     0.3995037     0.9913899   12.2    59.7
## Zambia                                   0.5863636     0.8539720   13.5    60.1
## Gambia                                   0.5523810     0.8709288    8.8    60.2
## Afghanistan                              0.1979866     0.1987421    9.3    60.4
##                                      GNI MMRatio ABRate  PRP
## Swaziland                           5542     310   72.0 14.7
## Lesotho                             3306     490   89.4 26.8
## Central African Republic             581     880   98.3 12.5
## Sierra Leone                        1780    1100  100.7 12.4
## Côte d'Ivoire                       3171     720  130.3  9.2
## Chad                                2085     980  152.0 14.9
## Mozambique                          1123     480  137.8 39.6
## Cameroon                            2803     590  115.8 27.1
## Burundi                              758     740   30.3 34.9
## South Africa                       12122     140   50.9 40.7
## Zimbabwe                            1615     470   60.3 35.1
## Mali                                1583     550  175.6  9.5
## Uganda                              1613     360  126.6 35.0
## Congo (Democratic Republic of the)   680     730  135.3  8.2
## Burkina Faso                        1591     400  115.4 13.3
## Benin                               1767     340   90.2  8.4
## Togo                                1228     450   91.5 17.6
## Zambia                              3734     280  125.4 12.7
## Gambia                              1507     430  115.8  9.4
## Afghanistan                         1885     400   86.8 27.6

Task 2: Perform principal component analysis (PCA) on the raw (non-standardized) human data. Show the variability captured by the principal components. Draw a biplot displaying the observations by the first two principal components (PC1 coordinate in x-axis, PC2 coordinate in y-axis), along with arrows representing the original variables. (0-2 points)

  • PCA on un-standardized data
pca_human_nonstandard <- prcomp(human_)
summary(pca_human_nonstandard)
## Importance of components:
##                              PC1      PC2   PC3   PC4   PC5   PC6    PC7    PC8
## Standard deviation     1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912 0.1591
## Proportion of Variance 9.999e-01   0.0001  0.00  0.00 0.000 0.000 0.0000 0.0000
## Cumulative Proportion  9.999e-01   1.0000  1.00  1.00 1.000 1.000 1.0000 1.0000
  • In this nonstandardized set, PC1 captures practically all of the variance by itself…
  • This is an effect of nonstandardization, as the ranges of all vars are currently massively different…
biplot(pca_human_nonstandard, choices = 1:2)
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped

Some interpretation. This is asked for in the next task:

  • As the data is not standardized, every variable is measured at a different scale. Only GNI is now highlighted as a contributing variable, as it range is massively higher than that of any other var and thus dominates the analysis.

  • Basically, the data is now reduced into the “very wealthy countries” and “everybody else”.

  • GNI is negatively associated with the 1st PC, vector points towards small PC1 values (see also “loadings” table below, it shows essentially the same thing).

    • As GNI increases, PC 1 values get smaller.
library(loadings)
## Warning: package 'loadings' was built under R version 4.2.3
pca_loading(pca_human_nonstandard)
## Warning in sqrt(1 - PC_loading^2): NaNs produced
## Standard deviations (1, .., p=8):
## [1] 1.854416e+04 1.855219e+02 2.518701e+01 1.145441e+01 3.766241e+00
## [6] 1.565912e+00 1.912052e-01 1.591112e-01
## 
## Rotation (n x k) = (8 x 8):
##                           PC1           PC2           PC3           PC4
## PSECED_FM_Ratio -5.607472e-06  0.0006713951 -3.412027e-05 -2.736326e-04
## LaborFM_ratio    2.331945e-07 -0.0002819357  5.302884e-04 -4.692578e-03
## EduExp          -9.562910e-05  0.0075529759  1.427664e-02 -3.313505e-02
## LifeExp         -2.815823e-04  0.0283150248  1.294971e-02 -6.752684e-02
## GNI             -9.999832e-01 -0.0057723054 -5.156742e-04  4.932889e-05
## MMRatio          5.655734e-03 -0.9916320120  1.260302e-01 -6.100534e-03
## ABRate           1.233961e-03 -0.1255502723 -9.918113e-01  5.301595e-03
## PRP             -5.526460e-05  0.0032317269 -7.398331e-03 -9.971232e-01
##                           PC5           PC6           PC7           PC8
## PSECED_FM_Ratio -0.0022935252  2.180183e-02  6.998623e-01  7.139410e-01
## LaborFM_ratio    0.0022190154  3.264423e-02  7.132267e-01 -7.001533e-01
## EduExp           0.1431180282  9.882477e-01 -3.826887e-02  7.776451e-03
## LifeExp          0.9865644425 -1.453515e-01  5.380452e-03  2.281723e-03
## GNI             -0.0001135863 -2.711698e-05 -8.075191e-07 -1.176762e-06
## MMRatio          0.0266373214  1.695203e-03  1.355518e-04  8.371934e-04
## ABRate           0.0188618600  1.273198e-02 -8.641234e-05 -1.707885e-04
## PRP             -0.0716401914 -2.309896e-02 -2.642548e-03  2.680113e-03

Task 3: Standardize the variables in the human data and repeat the above analysis. Interpret the results of both analysis (with and without standardizing). Are the results different? Why or why not? Include captions (brief descriptions) in your plots where you describe the results by using not just your variable names, but the actual phenomena they relate to. (0-4 points)

  • Standardization, to make the vars comparable. This process, already seen during previous weeks, centers the variables (mean = 0 for all of them), eliminating the “comparing apples and oranges” situation caused by wildly different variation and ranges in each variable.
human__scaled<-scale(human_)
summary(human__scaled)
##  PSECED_FM_Ratio   LaborFM_ratio         EduExp           LifeExp       
##  Min.   :-2.8189   Min.   :-2.6247   Min.   :-2.7378   Min.   :-2.7188  
##  1st Qu.:-0.5233   1st Qu.:-0.5484   1st Qu.:-0.6782   1st Qu.:-0.6425  
##  Median : 0.3503   Median : 0.2316   Median : 0.1140   Median : 0.3056  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5958   3rd Qu.: 0.7350   3rd Qu.: 0.7126   3rd Qu.: 0.6717  
##  Max.   : 2.6646   Max.   : 1.6632   Max.   : 2.4730   Max.   : 1.4218  
##       GNI             MMRatio            ABRate             PRP         
##  Min.   :-0.9193   Min.   :-0.6992   Min.   :-1.1325   Min.   :-1.8203  
##  1st Qu.:-0.7243   1st Qu.:-0.6496   1st Qu.:-0.8394   1st Qu.:-0.7409  
##  Median :-0.3013   Median :-0.4726   Median :-0.3298   Median :-0.1403  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.3712   3rd Qu.: 0.1932   3rd Qu.: 0.6030   3rd Qu.: 0.6127  
##  Max.   : 5.6890   Max.   : 4.4899   Max.   : 3.8344   Max.   : 3.1850
  • Now, let’s re-run the analysis and see what changes
human__scaled<-scale(human_)

pca_human_standard <- prcomp(human__scaled)
summary(pca_human_standard)
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     2.0708 1.1397 0.87505 0.77886 0.66196 0.53631 0.45900
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595 0.02634
## Cumulative Proportion  0.5361 0.6984 0.79413 0.86996 0.92473 0.96069 0.98702
##                            PC8
## Standard deviation     0.32224
## Proportion of Variance 0.01298
## Cumulative Proportion  1.00000
  • PC1 is no longer the only dominant component, as it captures just ca. 54 % of the variance.
  • PCs 1 and 2 together are enough to capture most, ca. 70% cumulatively, of the variation. After this, the “benefit” tapers off: the difference between PCs 2 and 3 is just ca. 9 %.
s<-summary(pca_human_standard)
Pros<-round(100*s$importance[2, ], digits = 2)
Pros<-paste0(names(Pros), " (", Pros, "%)")


biplot(pca_human_standard, choices = 1:2, cex = c(0.8, 1), col = c("grey40", "deeppink2"), xlab= Pros[1], ylab=Pros[2])

pca_loading(pca_human_standard)
## Standard deviations (1, .., p=8):
## [1] 2.0708380 1.1397204 0.8750485 0.7788630 0.6619563 0.5363061 0.4589994
## [8] 0.3222406
## 
## Rotation (n x k) = (8 x 8):
##                         PC1         PC2         PC3         PC4        PC5
## PSECED_FM_Ratio -0.35664370  0.03796058 -0.24223089  0.62678110 -0.5983585
## LaborFM_ratio    0.05457785  0.72432726 -0.58428770  0.06199424  0.2625067
## EduExp          -0.42766720  0.13940571 -0.07340270 -0.07020294  0.1659678
## LifeExp         -0.44372240 -0.02530473  0.10991305 -0.05834819  0.1628935
## GNI             -0.35048295  0.05060876 -0.20168779 -0.72727675 -0.4950306
## MMRatio          0.43697098  0.14508727 -0.12522539 -0.25170614 -0.1800657
## ABRate           0.41126010  0.07708468  0.01968243  0.04986763 -0.4672068
## PRP             -0.08438558  0.65136866  0.72506309  0.01396293 -0.1523699
##                         PC6         PC7         PC8
## PSECED_FM_Ratio  0.17713316  0.05773644  0.16459453
## LaborFM_ratio   -0.03500707 -0.22729927 -0.07304568
## EduExp          -0.38606919  0.77962966 -0.05415984
## LifeExp         -0.42242796 -0.43406432  0.62737008
## GNI              0.11120305 -0.13711838 -0.16961173
## MMRatio          0.17370039  0.35380306  0.72193946
## ABRate          -0.76056557 -0.06897064 -0.14335186
## PRP              0.13749772  0.00568387 -0.02306476
  • Interpretation:
  • Way nicer, more nuanced results
  • Maternal mortality and adolescent birth rates are associated with high values of PC1 (positive corr.).
    • The vectors for these are very much aligned, with only a small angle between them -> correlate positively with each other.
  • Parliament places held by women, as well as the ratio of women in the workforce, are at a 90 % angle (look at the vectors) in relation to both maternal mortality/adolescent birth pair, and its opposite. This indicates that these variables show no correlation with the others. They are associated with high values of PC2:
  • At the same time, PC1 values correlate negatively with GNI, Expected years of education, and life expectency.
  • Seems to describe a cluster of developing economies of the global south, mainly African.
  • Reinforces the idea that maternal mortality rates and adolescent birth rates are often high in countries where GDP, education levels and life expectancy are low.
  • Polar opposite of this grouping are the “global northeners”, industrialized nations characterized by comparatively high GDPs, education rates, and life expectencies.

Task 5: The tea data comes from the FactoMineR package and it is measured with a questionnaire on tea: 300 individuals were asked how they drink tea (18 questions) and what are their product’s perception (12 questions). In addition, some personal details were asked (4 questions).

Explore the data briefly: look at the structure and the dimensions of the data. Use View(tea) to browse its contents, and visualize the data. Use Multiple Correspondence Analysis (MCA) on the tea data (or on just certain columns of the data, it is up to you!). Interpret the results of the MCA and draw at least the variable biplot of the analysis. You can also explore other plotting options for MCA. Comment on the output of the plots. (0-4 points)

library(FactoMineR)
tea <- read.csv("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/tea.csv", stringsAsFactors = TRUE)
view(tea)
str(tea)
## 'data.frame':    300 obs. of  36 variables:
##  $ breakfast       : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
##  $ tea.time        : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
##  $ evening         : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
##  $ lunch           : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dinner          : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
##  $ always          : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
##  $ home            : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
##  $ work            : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
##  $ tearoom         : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ friends         : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
##  $ resto           : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
##  $ pub             : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Tea             : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How             : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ sugar           : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ how             : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ where           : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ price           : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
##  $ age             : int  39 45 47 23 48 21 37 36 40 37 ...
##  $ sex             : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
##  $ SPC             : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
##  $ Sport           : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
##  $ age_Q           : Factor w/ 5 levels "+60","15-24",..: 4 5 5 2 5 2 4 4 4 4 ...
##  $ frequency       : Factor w/ 4 levels "+2/day","1 to 2/week",..: 3 3 1 3 1 3 4 2 1 1 ...
##  $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
##  $ spirituality    : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
##  $ healthy         : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
##  $ diuretic        : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
##  $ friendliness    : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
##  $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ feminine        : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
##  $ sophisticated   : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
##  $ slimming        : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ exciting        : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
##  $ relaxing        : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
##  $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
dim(tea)
## [1] 300  36
  • Checking dimensions and structure
str(tea)
## 'data.frame':    300 obs. of  36 variables:
##  $ breakfast       : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
##  $ tea.time        : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
##  $ evening         : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
##  $ lunch           : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dinner          : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
##  $ always          : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
##  $ home            : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
##  $ work            : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
##  $ tearoom         : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ friends         : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
##  $ resto           : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
##  $ pub             : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Tea             : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How             : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ sugar           : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ how             : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ where           : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ price           : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
##  $ age             : int  39 45 47 23 48 21 37 36 40 37 ...
##  $ sex             : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
##  $ SPC             : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
##  $ Sport           : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
##  $ age_Q           : Factor w/ 5 levels "+60","15-24",..: 4 5 5 2 5 2 4 4 4 4 ...
##  $ frequency       : Factor w/ 4 levels "+2/day","1 to 2/week",..: 3 3 1 3 1 3 4 2 1 1 ...
##  $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
##  $ spirituality    : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
##  $ healthy         : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
##  $ diuretic        : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
##  $ friendliness    : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
##  $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ feminine        : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
##  $ sophisticated   : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
##  $ slimming        : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ exciting        : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
##  $ relaxing        : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
##  $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
dim(tea)
## [1] 300  36
  • Plenty of factor variables (class vars) describing tea usage.

  • Dims 300 x 36

  • Let’s simplify a bit and select a subset of these

  • Visual summary

# column names to keep in the dataset
keep_columns <- c("Tea", "How", "how", "sugar", "where", "lunch")

# select the 'keep_columns' to create a new dataset
tea_time <- select(tea,  keep_columns)
## Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
## ℹ Please use `all_of()` or `any_of()` instead.
##   # Was:
##   data %>% select(keep_columns)
## 
##   # Now:
##   data %>% select(all_of(keep_columns))
## 
## See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
# visual summary
library(ggplot2)
pivot_longer(tea_time, cols = everything()) %>% 
  ggplot(aes(value)) + facet_wrap("name", scales = "free")+geom_bar()+theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))

Multiple Correspondence Analysis

# multiple correspondence analysis
library(FactoMineR)
mca <- MCA(tea_time, graph = F)

summary(mca)
  • Plotting
plot(mca, invisible=c("ind"), graph.type = "classic")

  • tea shop & unpackaged have the highest effect on dimension 1
  • variables “other”, “chain store+tea shop” and “tea bag+unpackaged” have highest effect on dimension 2

Assignment 6: Analysis of longitudinal data

Task 1: Implement the analyses of Chapter 8 of MABS, using the R codes of Exercise Set 6: Meet and Repeat: PART I but using the RATS data (from Chapter 9 and Meet and Repeat: PART II). (0-7 points: 0-4 points for graphs or analysis results + 0-3 points for their interpretations)

  • Analysis workflow will follow the Exercise set part 1.
  • Difference is it will be done on the rat data, not the BPRS data.
  1. Load the Rats_lng data which was created in the wrangling script
  • Re-factorize the class vars
library(plotrix)
## Warning: package 'plotrix' was built under R version 4.2.3
library(here)
library(tidyverse)
RATS<-read.csv(here("Data/rats_lng.csv"))
RATS$X<-NULL
RATS$ID<-as.factor(RATS$ID)
RATS$Group<-as.factor(RATS$Group)
  1. Initial graphical summary of the unstandardized data. Individual rat’s growths…
library(ggplot2)
ggplot(RATS, aes(x = Time, y = Weight , linetype = ID)) +
  geom_line() +
  scale_linetype_manual(values = rep(1:10, times=4)) +
  facet_grid(. ~ Group, labeller = label_both) +
  theme(legend.position = "none") + 
  scale_y_continuous(limits = c(min(RATS$Weight), max(RATS$Weight)))

  • Rat group 1 has the lowest overall weights, and individuals group close together
  • Rat group 2 has a potential outlier individual, starting weight 550 at t = 0.
  1. Standardization and re-plotting
RATS <- RATS %>%
  group_by(Group) %>%
  mutate(stdweight = (Weight-mean(Weight))/sd(Weight)) %>%
  ungroup()
library(ggplot2)
ggplot(RATS, aes(x = Time, y = stdweight, linetype = ID)) +
  geom_line() +
  scale_linetype_manual(values = rep(1:10, times=4)) +
  facet_grid(. ~ Group, labeller = label_both) +
  scale_y_continuous(name = "standardized weight")

  • Standardization enables better observation of individual differences in growth curves.
  • In 1st rat group, there’s another, previously hidden outlier whose standardized weight is lowest of all the others in the group and remains consistently low, never coming close to the weights of the others in the group (tracking phenomenon in reverse, weight starts low and remains low…). A similar “scrawny rat” can be seen in group 3, but that one takes on weight later on and catches up with his group.
  1. Aggregate summary graphs
  • Next, mean weight + SD is calculated to better show how the treatment groups as a whole behave, instead of individual rats.
library(plotrix)
# Summary data with mean and standard error of weight by group and time 
RATS <- RATS %>%
  group_by(Group, Time) %>%
  summarise( mean = mean(Weight), se = std.error(Weight)) %>%
  ungroup()
## `summarise()` has grouped output by 'Group'. You can override using the
## `.groups` argument.
# Plot the mean profiles
library(ggplot2)
ggplot(RATS, aes(x = Time, y = mean, linetype = Group, shape = Group)) +
  geom_line() +
  scale_linetype_manual(values = c(1,2,3)) +
  geom_point(size=3) +
  scale_shape_manual(values = c(1,2,3)) +
  geom_errorbar(aes(ymin=mean-se, ymax=mean+se, linetype="1"), width=0.3) +
  theme(legend.position = c(0.95,0.45)) +
  scale_y_continuous(name = "mean(Weight) +/- se(Weight)")

  • Group 1 is decidedly in it’s own league, with very different behaviour compared to 2 and 3
  • SE:s of g1 and g3 mostly overlap and their overall profile is very similar. No sig. diff. between the 2?
  1. Boxplot summary of rat groups’ mean weights
RATS<-read.csv(here("Data/rats_lng.csv"))
RATS$X<-NULL
RATS$ID<-as.factor(RATS$ID)
RATS$Group<-as.factor(RATS$Group)

Rats_mean <- RATS %>%
  group_by(Time, Group) %>%
  summarise( mean=mean(Weight) ) %>%
  ungroup()
## `summarise()` has grouped output by 'Time'. You can override using the
## `.groups` argument.
library(ggplot2)
ggplot(Rats_mean, aes(x = Group, y = mean)) +
  geom_boxplot() +
  stat_summary(fun = "mean", geom = "point", shape=23, size=4, fill = "white") +
  scale_y_continuous(name = "mean(Weight), time 1-22")

  • Again G1 plays it’s own games, with clearly lower mean weight and less variation than in the other groups.
  • G2 shows the most variance in weights within the group, while G3 has overall highest mean weight.
  • No outliers identifiable.
  1. Statistical testing
  • Here, we do not have baseline data (Time = 0). Measurements start at t=1.
  • As there are now 3 groups of rats, a 2-sample t test will not be a feasible tool. It can deal with 2 groups, but no more than that.
  • As such we will have go straight to anova.
# Fit the linear model with the mean as the response 
fit <- lm(mean ~ Group, data = Rats_mean)
anova(fit)
  • ANOVA detects a statistically significant (P < 0001) difference in weight of rats between treatment groups.
  • Let’s see if the same can be said the distinctly different G1 is omitted. Do only 2 and 3 differ enough for anova to detect a difference?
Rats_mean_filtered<-filter(Rats_mean, Group %in% c(2,3))
# Fit the linear model with the mean as the response 
fit <- lm(mean ~ Group, data = Rats_mean_filtered)
anova(fit)

*Looks like the difference between G2 and G3 is still large enough that a statistically significant difference can be seen.

Task 2: Implement the analyses of Chapter 9 of MABS, using the R codes of Exercise Set 6: Meet and Repeat: PART II, but using the BPRS data (from Chapter 8 and Meet and Repeat: PART I).

(0-8 points: 0-4 points for graphs or analysis results + 0-4 points for their interpretations)

  1. Load the long-form BPRS data created in the wrangling script
library(readr)
BPRS_lng <- read_csv(here("Data/BPRS_lng.csv"))
## New names:
## Rows: 360 Columns: 6
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (1): weeks dbl (5): ...1, treatment, subject, pbrs, Week
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
BPRS_lng$...1<-NULL

#Re-factorize the treatment var
BPRS_lng$treatment<-as.factor(BPRS_lng$treatment)
  1. Statistical analysis.
  • First, we fit the BPRS data with a random intercept model. This differs from regular lm-function linear regressions in that each test subject in the data can have different regression slope intercept points than the rest.
  • Moreover, the model does not assume that the observations (here taken from the same patients over time) are independent.
  • We seek to explain the PBRS score with treatment group and time. Subject ID is chosen as the random effect, varying from individual to individual. Treatment and time are the fixed effects.
library(lme4)
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
# Create a random intercept model
BPRS_RandomInt <- lmer(pbrs ~ Week + treatment + (1 | subject), data = BPRS_lng, REML = FALSE)

# Print the summary of the model

summary(BPRS_RandomInt)

anova(BPRS_RandomInt)
  • Random effects (group) shows variance of 104 and standard deviation of ca. 7 -> Plenty of between individual variation in regression intercepts

  • AIC (Akaike information criteria) facilitates the comparison between models: if model A has a lower AIC rating than B, A can be considered to fit the data better. Now, AIC is 2748.

  • Concerning Fixed effects,negative correlation can be seen with time and bprs rating. Group and pbrs correlates positively.

  • Model w. random intercepts and random slopes

library(lme4)

BPRS_int_coeff<-lmer(pbrs ~ Week + treatment + (Week | subject), data = BPRS_lng, REML = FALSE)

# print a summary of the model

summary(BPRS_int_coeff)

# perform an ANOVA test on the two models
anova(BPRS_int_coeff)
  • AIC decreases to 2745 -> Better fit than the previous model, in which just the intercepts were allowed to vary between individuals.

  • Considerable increase in the variance of random effects (between-individual variation strong regarding both intercepts and slopes).

  • Regarding fixed effects, the coefficient directions appear unchanged.

  • Interaction model

*Next, we add an interaction variable: week x treatment.

library(lme4)

Interaction<-lmer(pbrs ~ Week + treatment + (Week | subject)+Week*treatment, data = BPRS_lng, REML = FALSE)

# print a summary of the model

summary(Interaction)

# perform an ANOVA test on the two models
anova(Interaction)
  • AIC decreases even further, indicating that the model fits the data better with the interaction than without it.

  • Final anova on all 3 candidate models

anova(Interaction,BPRS_int_coeff,BPRS_RandomInt)
  • In the summary, we can see that as we keep developing the model further, AIC keeps improving. Ultimately the third and final model (with individual variation in both intercept and slope allowed, and interaction of time & subject ID) fits the data best.